Information Theory

1. Information Entropy
2. Cross Entropy
- 2.1. Definition
3. Kullback-Leibler Divergence
- 3.1. Definition
- 3.2. Properties
4. Mutual Information
- 4.1. Definition
- 4.2. Properties
5. Fisher Information
6. References

1. Information Entropy

Shannon Entropy

1.1. Definition

For a random variable \(X\), which takes values in the alphabet \(\mathcal{X}\) and is distributed according to \(p: \mathcal{X} \to [0, 1]\), the entropy is: \[ \operatorname{H}[X] := -\sum_{x\in\mathcal{X}}p(x)\log p(x). \] where the base of the logarithm may be \( 2 \), \(e\), \( 10 \), and any others.

Note the similarity to Gibbs entropy.

1.2. Differential Entropy

It is not the limit of the Shannon entropy.
\[ h[f] := -\int_{-\infty}^\infty f(x)\log f(x)\,dx = \lim_{\Delta \to 0} (\mathrm{H}^\Delta + \log \Delta) \]
\[ H^{\Delta} := -\sum_{i=-\infty}^{\infty} f(x_i) \Delta \log \left( f(x_i) \Delta \right) \]
Shannon simply assumed that this was the correct formula for the entropy for continuous cases. It is later adjusted with the limiting density of discrete points (LDDP).

1.3. Thermodynamic Entropy

It is the case when the probability distribution has physical meanings.
The possesion of Shannon information (negentropy) reduces the entropy.
Information to free energy conversion is possible.

1.4. Third Law of Thermodynamics

The fact that the entropy cannot get lower than zero, puts restriction on the stochastic system.

There's minimal bound to the entropy of a configuration. Therefore if the mutual information is nonzero, even though the joint distribution has zero entropy, but if two variables are not linearly independent, Then the cross entropy is zero.

2. Cross Entropy

2.1. Definition

\begin{equation*} H(p,q) = \int_{\mathbb{R}} p(x) \log \frac{1}{q(x)}\,dx. \end{equation*}

The expected surprise of an event following the probability distribution \( p(x) \), given a prior probability distribution about the event \( q(x) \).

3. Kullback-Leibler Divergence

KL Divergence, Relative Entropy

3.1. Definition

The increase of entropy by believing in a wrong model \( q(x) \):

\begin{equation*} D_{\rm KL}(p\Vert q) := H(p,q) - H(p) = \int_{\mathbb{R}} p(x)\log \frac{p(x)}{q(x)}\,dx. \end{equation*}

3.2. Properties

KL divergence is at its minimum when two distributions are the same.
- It can be shown with Jensen's inequality.

4. Mutual Information

4.1. Definition

For two probability distribution \( \rho_A, \rho_B \) and the joint distribution \( \rho_{AB} \), \[ I(A;B) = \int_{AB}\rho_{AB}\log \frac{\rho_{AB}}{\rho_A\rho_B}\,dA\,dB. \]

In other words, \[ I(X;Y) := D_{\mathrm{KL}}( \rho_{XY} \Vert \rho_X\rho_Y). \]

4.2. Properties

\( H(\rho_{AB}) = H(\rho_A) + H(\rho_A) - I(A;B) \)
The mutual infromation is zero if two events are independent.

5. Fisher Information

It is the measure of information an observation carries about the unknown parameter.

5.1. Definition

Fisher information is defined to be the variance of the score:

\begin{equation*} \mathcal{I}(\theta) := \mathrm{Var}\left[ \left. \frac{\partial}{\partial \theta}\log f(X;\theta) \right| \theta \right] \end{equation*}

5.1.1. Score Function

The derivative of the log likelihood with respect to parameter.

5.2. Properties

The expected score is zero at true parameter \( \theta^{*} \)

\begin{align*} \mathrm{E}\left[ \left. \frac{\partial}{\partial \theta} \log f(X;\theta) \right| \theta^{*} \right] &= \int_{\mathbb{R}} \left( \frac{\partial}{\partial \theta} \log f(x;\theta) \right)\Bigg|_{\theta = \theta^{*}} f(x;\theta^{*})\,dx \\ &= \frac{\partial}{\partial \theta} \int_{\mathbb{R}} f(x;\theta)\,dx = 0. \end{align*}

where \( f \) is the probability density distribution of \( X \).

The variance of score at true parameter is equal to the expected derivative of score, under suitable regularity condition

\begin{align*} \mathrm{Var}\left[ \left. \frac{\partial}{\partial \theta}\log f(X;\theta) \right| \theta^{*} \right] &= \int_{\mathbb{R}}\left( \frac{\partial}{\partial \theta} \log f(x; \theta) \right)^2\Bigg|_{\theta = \theta^{*}} f(x; \theta^{*})\,dx \\ &= \int_{\mathbb{R}} \frac{\partial^2}{\partial \theta^2} f(x;\theta)\Bigg|_{\theta = \theta^{*}} - \frac{\partial^2}{\partial \theta^2}\log f(x;\theta ) \Bigg|_{\theta = \theta^{*}} f(x;\theta^{*}) \, dx \\ &= -\mathrm{E}\left[ \left. \frac{\partial^2}{\partial \theta^2} \log f(X;\theta) \right| \theta^{*} \right] \end{align*}

Let the cross-entropy \( H(\theta) \) be

\begin{equation*} H(\theta) = -\int_{\mathbb{R}} f(x; \theta^{*}) \log L(\theta | x)\,dx \end{equation*}

where \( L(\theta | x) = f(x; \theta)\) is the likelihood.

The negative of the expected score is the derivative of the cross-entropy:

\begin{align*} - \mathrm{E}\left[ \left. \frac{\partial}{\partial \theta} \log f(X;\theta) \right| \theta \right] &= - \int_{\mathbb{R}} \left(\frac{\partial}{\partial \theta} \log f(x;\theta)\right) f(x;\theta^{*})\,dx \\ &= -\frac{\partial}{\partial \theta} \int_{\mathbb{R}} f(x;\theta^{*}) \log L(\theta|x) \, dx\\ &= \frac{\partial}{\partial \theta}H(\theta). \end{align*}

Similarly, for the expected derivative of score at true parameter is the second derivative of the cross-entropy:

\begin{align*} \mathcal{I}(\theta^{*}) &= -\mathrm{E}\left[ \left. \frac{\partial^2}{\partial \theta^2} \log f(X;\theta) \right| \theta^{*} \right] \\ &= \frac{\partial^2}{\partial \theta^2}H(\theta) \Bigg|_{\theta = \theta^{*}}. \end{align*}

This means that the cross-entropy must minimized at the true parameter, and the variance of score is related to the curvature of the cross-entropy.

5.3. Cramér-Rao Bound

The reciprocal of the Fisher information is a lower bound on the variance of any unbiased estimator of \( \theta \):

\begin{equation*} \mathrm{Var}\left[ \hat{\theta}(X) \right] \ge \frac{1}{\mathcal{I}(\theta)}. \end{equation*}

In particular, the variance of the maximum likelihood estimator \( \hat{\theta}_{\rm MLE} \) has the variance of the reciprocal of Fisher information \( \mathcal{I}(\theta)^{-1} \), i.e. the lowest possible variance.

5.3.1. Derivation

The inequality can be obtained by taking derivative of the nonbiased condition and applying the Cauchy-Schwarz inequality.

5.4. Fisher Information Matrix

It is defined to be the covariance matrix of the gradient vector of score with respect to the parameters:

\begin{equation*} \mathcal{I}(\boldsymbol{\theta})_{ij} = \mathrm{Cov}\left[\left.\frac{\partial }{\partial \theta_i}\log f(X| \boldsymbol{\theta}), \frac{\partial }{\partial \theta_j}\log f(X| \boldsymbol{\theta}) \right| \boldsymbol{\theta}\right] \end{equation*}

Under certain regularity conditions, the Fisher information matrix at ture parameter is given by the Hessian of the relative-entropy (or, cross-entrpy):

\begin{align*} \mathcal{I}(\boldsymbol{\theta^{*}})_{ij} &= -\mathrm{E}\left[\left.\frac{\partial^2 }{\partial \theta_i\partial\theta_j}\log f(X| \boldsymbol{\theta}) \right| \boldsymbol{\theta^{*}}\right] \\ &= \frac{\partial^2}{\partial\theta_i\partial\theta_j}H(\boldsymbol{\theta})\bigg|_{\boldsymbol{\theta} = \boldsymbol{\theta}^{*}}. \end{align*}